Before performing the analysis, several R-based packages should be installed on your local machine. Note that the packages may require an additional installation of other dependencies. Make sure that you have installed recommended version of the packages and your machine meets the mimimal system requirements for carrying out this analysis.
library(scater)
library(M3Drop)
library(monocle)
library(Seurat)
library(mclust)
library(scran)
library(SC3)
Zeisel2015 is a UMI-based scRNAseq dataset published in Zeisel et al. (2015) (Link to study: http://science.sciencemag.org/content/347/6226/1138) that contains the expression matrix of endogenous, spike-in and mitochondrial genes from 3005 cells from mouse cortex and hippocampus tissues. Additional metadata information are available, including cells origin, type or tissue. For the purpose of this tutorial we stored Zeisel2015 count matrix with metadata information in two separate txt files. We can read these files and create an easy-to-work SingleCellExperiment experiment object.
all.counts <- read.table(file ="data/all.counts.txt", sep = "\t")
metadata <- read.table(file ="data/metadata.txt", sep = "\t", header = TRUE)
sce <- SingleCellExperiment(list(counts=as.matrix(all.counts)), colData=metadata)
sce
## class: SingleCellExperiment
## dim: 19896 3005
## metadata(0):
## assays(1): counts
## rownames(19896): Tspan12 Tshz1 ... ERCC-00170 ERCC-00171
## rowData names(0):
## colnames(3005): V3 V4 ... V3006 V3007
## colData names(10): tissue group ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
We can explore the information about cells and genes stored in SingleCellExperiment experiment object under colData and rawData slots, respectively.
counts(sce)[1:4,1:4]
## V3 V4 V5 V6
## Tspan12 0 0 0 3
## Tshz1 3 1 0 2
## Fnbp1l 3 1 6 4
## Adamts15 0 0 0 0
names(colData(sce)) #available metadata information
## [1] "tissue" "group" "mRNA" "well" "sex"
## [6] "age" "diameter" "cellId" "level1class" "level2class"
table(colData(sce)$tissue) #tissue annotation
##
## ca1hippocampus sscortex
## 1314 1691
table(colData(sce)$level1class) #cell type annotation
##
## astrocytes_ependymal endothelial-mural interneurons
## 224 235 290
## microglia oligodendrocytes pyramidal CA1
## 98 820 939
## pyramidal SS
## 399
names(rowData(sce))
## character(0)
As mentioned, Zeisel2015 dataset contains also the expression of spike-ins and mitochondrial genes. Spike-in gene names usually starts from “ERCC” and mitochondrial genes contain “mt” string.
is.spike <- grepl("^ERCC", rownames(sce))
summary(is.spike)
## Mode FALSE TRUE
## logical 19839 57
is.mito <- grepl("^mt-", rownames(sce))
summary(is.mito)
## Mode FALSE TRUE
## logical 19862 34
We can use scater package to automatically calculate several quality metrics per cell and per gene. These metrics will be useful to i.e. detect low quality cells or lowly expressed genes that should be removed from further analysis. Remember to not define spike-ins in feature_controls if your dataset do not contain any ERCC genes.
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
colnames(colData(sce))
## [1] "tissue"
## [2] "group"
## [3] "mRNA"
## [4] "well"
## [5] "sex"
## [6] "age"
## [7] "diameter"
## [8] "cellId"
## [9] "level1class"
## [10] "level2class"
## [11] "is_cell_control"
## [12] "total_features_by_counts"
## [13] "log10_total_features_by_counts"
## [14] "total_counts"
## [15] "log10_total_counts"
## [16] "pct_counts_in_top_50_features"
## [17] "pct_counts_in_top_100_features"
## [18] "pct_counts_in_top_200_features"
## [19] "pct_counts_in_top_500_features"
## [20] "total_features_by_counts_endogenous"
## [21] "log10_total_features_by_counts_endogenous"
## [22] "total_counts_endogenous"
## [23] "log10_total_counts_endogenous"
## [24] "pct_counts_endogenous"
## [25] "pct_counts_in_top_50_features_endogenous"
## [26] "pct_counts_in_top_100_features_endogenous"
## [27] "pct_counts_in_top_200_features_endogenous"
## [28] "pct_counts_in_top_500_features_endogenous"
## [29] "total_features_by_counts_feature_control"
## [30] "log10_total_features_by_counts_feature_control"
## [31] "total_counts_feature_control"
## [32] "log10_total_counts_feature_control"
## [33] "pct_counts_feature_control"
## [34] "pct_counts_in_top_50_features_feature_control"
## [35] "pct_counts_in_top_100_features_feature_control"
## [36] "pct_counts_in_top_200_features_feature_control"
## [37] "pct_counts_in_top_500_features_feature_control"
## [38] "total_features_by_counts_Spike"
## [39] "log10_total_features_by_counts_Spike"
## [40] "total_counts_Spike"
## [41] "log10_total_counts_Spike"
## [42] "pct_counts_Spike"
## [43] "pct_counts_in_top_50_features_Spike"
## [44] "pct_counts_in_top_100_features_Spike"
## [45] "pct_counts_in_top_200_features_Spike"
## [46] "pct_counts_in_top_500_features_Spike"
## [47] "total_features_by_counts_Mt"
## [48] "log10_total_features_by_counts_Mt"
## [49] "total_counts_Mt"
## [50] "log10_total_counts_Mt"
## [51] "pct_counts_Mt"
## [52] "pct_counts_in_top_50_features_Mt"
## [53] "pct_counts_in_top_100_features_Mt"
## [54] "pct_counts_in_top_200_features_Mt"
## [55] "pct_counts_in_top_500_features_Mt"
colData(sce)$total_counts[1:5] #Total count for each cell
## [1] 29060 29304 38929 40557 27625
colData(sce)$total_features_by_counts[1:5] #Number of genes with non zero counts per cell
## [1] 4898 4750 6090 5887 4753
if(length(is.spike==TRUE)>0){colData(sce)$pct_counts_Spike[1:5]} #Percentage of spike-in counts per cell
## [1] 23.07639 21.95946 16.27322 17.33856 21.46968
colnames(rowData(sce))
## [1] "is_feature_control" "is_feature_control_Spike"
## [3] "is_feature_control_Mt" "mean_counts"
## [5] "log10_mean_counts" "n_cells_by_counts"
## [7] "pct_dropout_by_counts" "total_counts"
## [9] "log10_total_counts"
rowData(sce)$mean_counts[1:5] #Average count per gene
## [1] 0.27886855 0.42362729 1.04292845 0.04459235 0.37936772
rowData(sce)$n_cells_by_counts[1:5] #Number of cells with non zero counts per gene
## [1] 472 573 1167 77 669
rowData(sce)$pct_dropout_by_counts[1:5] #Percentage of dropouts per gene
## [1] 84.29285 80.93178 61.16473 97.43760 77.73710
One can plot histograms based on quality metrics to determine possible thresholds for cell/gene filterings. We can also visualize some of the quality metrics on the PCA projected data.
assay(sce, "logcounts_raw") <- log2(counts(sce)+1)
hist(sce$total_counts, breaks=50, xlab="Library size", ylab="Number of cells")
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_counts")
hist(sce$total_features_by_counts, xlab="Number of expressed genes", breaks=50, ylab="Number of cells")
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_features_by_counts")
if(length(is.spike==TRUE)>0)
{hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)", ylab="Number of cells", breaks=50)
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="pct_counts_Spike")}
Alternatively, isOutlier function provides an automatic detection of low quality cells based on quality metrics. In this example, we are filtering cells based on library size, number of expressed genes per cell and total count over all spike-in transcripts in each cell. Excluded cells are those with the total number of expressed genes and the total sum of counts more than 3 median absolute deviations below the median across the genes. Note that we dont filter many cells in this step, it is likely that dataset was already quality controlled by the original study.
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
if(length(is.spike==TRUE)>0){spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher")
ind <- libsize.drop | feature.drop | spike.drop}else{ind <- libsize.drop | feature.drop}
sce$removed=ind
plotPCA(sce, colour_by="removed", run_args=c(exprs_values="logcounts_raw"), shape_by="removed")
if(length(is.spike==TRUE)>0)
{sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
}else{sce <- sce[,!(libsize.drop | feature.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), Remaining=ncol(sce))
}
## ByLibSize ByFeature BySpike Remaining
## 1 8 3 8 2989
We can also explore highly and overexpressed genes. The overexpressed genes, are those genes which expression is higher than others by several magnitudes. Overexpressed genes can bias further procedures such as clustering, therefore, they should be removed before performing downstream analysis. Note that highly expressed genes mostly include spike-ins and mitochondrial genes. One of the most commonly overexpressed endogenous gene is “Malat1”.
plotHighestExprs(sce, exprs_values = "counts")
In this step we will filter spike-in and mitochondrial genes as well as “Malat1”.
if(length(is.spike==TRUE)>0){sce=sce[-which(grepl("^ERCC-", rownames(sce))),]}
sce=sce[-which(grepl("^mt-", rownames(sce))),]
sce <- sce[-which(rownames(sce)=="Malat1"),]
dim(sce)
## [1] 19804 2989
Now, we should filter out lowly expressed genes as they do not provide any insights into the underlying biology. In the filtering we removed lowly expressed genes that are genes with average expression count (adjusted by library size) equal to 0. Note that we dont filter many genes in this step, it is likely that dataset was already quality controlled by the original study.
rowData(sce)$ave.counts <- calcAverage(sce, exprs_values = "counts", use_size_factors=FALSE)
to.keep <- rowData(sce)$ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)
## Mode FALSE TRUE
## logical 2 19802
dim(sce)
## [1] 19802 2989
After quality control and filtering, one should normalize the dataset to remove potential technical bias. One strategy is to use CPM (Count Per Million) normalization which is a correction to remove the noise related to sequencing depth. CPM divides each count by its total sum (across all the genes) and multiplies by one million. In this way, each cell has the same total sum of the counts. We can plot PCA on the log-transformed CPM normalized counts and compare with previous PCA on the log raw counts.
cpm(sce) <- calculateCPM(sce, use_size_factors=FALSE)
cpm(sce)[1:4,1:4]
## V3 V4 V5 V6
## Tspan12 0.0000 0.00000 0.000 94.75081
## Tshz1 144.5226 47.89501 0.000 63.16720
## Fnbp1l 144.5226 47.89501 197.336 126.33441
## Adamts15 0.0000 0.00000 0.000 0.00000
assay(sce, "logcounts") <- log2(cpm(sce)+1)
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))
set.seed(100)
plotTSNE(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
plotUMAP(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
Another strategy, more suitable for scRNAseq data, is deconvolution method implemented in scran package. The deconvolution method normalizes data by cells-pooled size factors that account for dropout biases. More details about this normalization technique can be found in study https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7. Similarly, we will plot PCA on the log scran normalized counts.
set.seed(100)
clusters <- quickCluster(sce, min.size=200, min.mean=0.1, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, min.mean=0.1)
sce <- scater::normalize(sce, return_log = FALSE)
normcounts(sce)[1:4,1:4]
## V3 V4 V5 V6
## Tspan12 0.000000 0.0000000 0.000000 1.278561
## Tshz1 1.894734 0.6425685 0.000000 0.852374
## Fnbp1l 1.894734 0.6425685 2.582446 1.704748
## Adamts15 0.000000 0.0000000 0.000000 0.000000
assay(sce, "logcounts") <- log2(normcounts(sce)+1)
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))
To deal with a large number of dimensions in scRNAseq dataset a feature selection or dimension reduction strategies can be applied.
Feature selection step is aimed at preserving biologically relevant information in the dataset and improving computational efficiency of downstream analyses. In most of the cases, it seeks for highly variable genes based on mean/variance relationship. M3Drop is one of the packages for automatic detection of most variable genes. It first calculates the mean and square coefficient of variation and fits the quadratic curve between the two variables. Then a chi-square test is used to find high variable genes which are significantly above the curve. We found CPM adjusted counts more appriopriate to use as input for this method rather than scran normalized values. Note that when using older versions of M3Drop package (<1.10.0) the output is an array of genes instead of gene table with p-values.
hvg_table <- BrenneckeGetVariableGenes(cpm(sce), fdr = 0.01, minBiolDisp = 0.5)
length(hvg_table$Gene)
## [1] 8183
sce_sub=sce[as.character(hvg_table$Gene),]
Dimension reduction can be used to visualize basic structure of the data or to reduce the number of features prior downstream analysis such as clustering. In contrast to the feature selection which extracts the most informative features, dimension reduction techniques projects the data into a new low-dimensional space that preserves the structure of the data. To reduce dataset dimensionality one can use previously mentioned PCA or more novel techniques such as tSNE or UMAP (all implemented in scater package). Note that projections should be applied to normalized and log-transformed count matrices. PCA is a deterministic approach while tSNE and UMAP are stochastic - for reproducibility of tSNE and UMAP projections one has to set the seed for generating random variables.
assay(sce_sub, "logcounts") <- log2(cpm(sce_sub)+1)
plotPCA(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
set.seed(100)
plotTSNE(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
plotUMAP(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
To cluster cells we can use SC3 package. SC3 is a consensus clustering method for single-cell RNA-seq data. It first calculates distances between the cells using three metrics: Euclidean, Pearson, and Spearman. On each of the obtained distance matrix two transformations are applied: PCA and Laplacian graph. Then K-Means clustering technique is used to cluster the transformed distance matrices subject to the first d eigenvectors. In result, several individual clusterings are obtained which are further combined into a single consensus clustering using Cluster-based Similarity Partitioning Algorithm (CSPA). In this example we will use available cell annotation in the clustering inference of SC3. In the real life applications we usually do not know the exact number populations in the sample. Note that clustering cells with SC3 may take a substantial amount of time. We will visualize results on the PCA projected space.
k_input=length(unique(sce$level1class))
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3(sce, ks = k_input, gene_filter = FALSE, biology = TRUE, n_cores=4)
table(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]])
##
## 1 2 3 4 5 6 7
## 960 8 708 49 517 311 436
assay(sce, "logcounts") <- log2(cpm(sce)+1)
plotTSNE(sce, colour_by=paste0("sc3_",k_input, "_clusters"), run_args=c(exprs_values = "logcounts"))
If the cell annotation is available, one can measure the effectiveness of the clustering output using Adjusted Rand Index (ARI). ARI index measures similarity between the partition obtained from clustering and the partition obtained from dataset annotation. The values of the ARI range can be negative if the agreement of the partitions is worse then the agreement expected by chance, or between 0 and 1 for clustering better then chance. ARI close to 1 indicate high accuracy of the method in detecting annotated cell populations.
adjustedRandIndex(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]],sce$level1class)
## [1] 0.7940186
How the clustering performance would change if you would estimate the number of cell populations in SC3 using sc3_estimate_k function? Is the number of estimated clusters close to the annotated number of cell populations ?
You can use colour_by argument to color points in any of the above projections by a selected marker gene or any information stored in the annotation. Additional violin plots allow you to i.e. visualize the expression of the selected gene across all the cells from a given cell type. For the illustrative purpose, we used Gad1 gene taken from the literature, which is a marker for interneuron cell type in the cortex tissue.
set.seed(100)
plotTSNE(sce, colour_by="Gad1", run_args=c(exprs_values = "logcounts"))
plotExpression(sce, "Gad1", x = "level1class", colour_by = "level1class", exprs_values = "logcounts")
Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. All procedures are based on Seurat object which can store the information about counts and dataset annotation. Note that when creating new Seurat object one can perform cell and gene filterings based on specified thresholds. Here we set both filterings to 0, as we will input already quality controlled and filtered counts.
Note that in Seurat versions <3.1.1 slot counts is relaced by raw.data, slot min.features is replaced by min.genes, and instead of seur@assays$RNA@counts user access the data by calling seur@data.
seur <- CreateSeuratObject(counts = counts(sce), meta.data = data.frame(colData(sce)), min.cells = 0, min.features = 0)
seur
## An object of class Seurat
## 19802 features across 2989 samples within 1 assay
## Active assay: RNA (19802 features)
seur@assays$RNA@counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## V3 V4 V5 V6
## Tspan12 . . . 3
## Tshz1 3 1 . 2
## Fnbp1l 3 1 6 4
## Adamts15 . . . .
Seurat also provides function to normalize data by a scaling factor followed by log-transformation.
seur <- NormalizeData(seur, normalization.method = "LogNormalize")
seur@assays$RNA@data[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## V3 V4 V5 V6
## Tspan12 . . . 0.6665506
## Tshz1 0.8941375 0.3913325 . 0.4896053
## Fnbp1l 0.8941375 0.3913325 1.089693 0.8168434
## Adamts15 . . . .
For extracting most informative genes one can use FindVariableGenes function. When using this function one can change the selection method (selection.method). For the illustrative purpose, we will use vst method which fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the gene expression values using the observed mean and expected variance (given by the fitted line). Gene variance is then calculated on the standardized values after clipping to a maximum (clip.max). For more details see help of FindVariableGenes function.
Note that in Seurat versions <3.1.1 function FindVariableFeatures is relaced by FindVariableGenes and there is no function VariableFeaturePlot.
seur <- FindVariableFeatures(object = seur, selection.method="vst", nfeatures = 2000)
top10 <- head(VariableFeatures(seur), 10)
plot1 <- VariableFeaturePlot(seur)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
For centering and scaling dataset you can use ScaleData function. Setting do.center=TRUE will center the expression for each gene by subtracting the average expression for that gene. Setting do.scale=TRUE will scale the expression for each gene by dividing the centered gene expression levels by their standard deviations (if do.center=TRUE), and by their root mean square, otherwise.
seur <- ScaleData(object = seur, do.center = TRUE, do.scale = TRUE)
seur@assays$RNA@scale.data[1:4,1:4]
## V3 V4 V5 V6
## Tshz1 1.6518806 0.4900755 -0.4141557 0.71714932
## 2310042E22Rik -0.1752873 -0.1752873 2.5351143 3.39875734
## Sema3c 2.7277730 -0.4150941 3.3787651 0.05362024
## Jam2 0.6895176 -0.3571584 0.3997203 -0.35715840
In this step, we will perform PCA dimension reduction, build nearest neighbor graph (on the Euclidean distances between any pair of cells in the PCA space) and clusterize cells using modularity optimization technique (Louvain algorithm) that seeks densily connected modules (cell clusters) in a graph structure. Note that FindClusters do not require from the user to input true number of cell populations. However the clustering performance strongly depends on the resolution parameter that controls the size of graph communities, hence the size of the clusters (for more details see the description of FindClusters function).
seur <- RunPCA(object = seur, pc.genes=seur@var.genes, ndims.print = 1:2, nfeatures.print = 5) #Reducing dimension with PCA
seur <- FindNeighbors(seur) #Computing nearest neighbor graph
seur <- FindClusters(object = seur, resolution = 0.1, algorithm = 1) #Clustering cells with modularity detection algorithm on graph
table(Idents(seur))
##
## 0 1 2 3 4 5 6 7 8
## 787 676 403 342 320 178 160 63 60
We can use UMAP dimension reduction to compare the Seurat clustering and annotated cell populations.
seur <- RunUMAP(seur, dims = 1:10)
DimPlot(seur, group.by = "ident", reduction.use = "umap")
DimPlot(seur, group.by = "level1class", reduction.use = "umap")
Finally, we will verify the accuracy of the clustering using ARI index.
adjustedRandIndex(Idents(seur), seur@meta.data$level1class)
## [1] 0.6986632
How the clustering would change if you would manipulate the resolution parameter?
Here we will use Monocle to perform a complete analysis on the example dataset. Monocle uses its own data object - CellDataSet, which can store the information about the experiment (expression matrix and metadata). Note that when creating new CellDataSet object the data should not be normalized - the method will normalize matrix internally when performing downstream analysis steps. It is suggested to set expressionFamily=negbinomial.size() when working with UMI or raw read counts to define the underlying distribution of the dataset. For more details see: http://cole-trapnell-lab.github.io/monocle-release/docs/#installing-monocle.
rowData(sce)$gene_short_name <- rownames(sce)
pd <- new("AnnotatedDataFrame", data = data.frame(colData(sce))) #cell info
fd <- new("AnnotatedDataFrame", data = data.frame(rowData(sce))) #gene info
cds <- newCellDataSet(cellData = counts(sce), phenoData = pd, featureData = fd, expressionFamily=VGAM::negbinomial.size())
cds
## CellDataSet (storageMode: environment)
## assayData: 19802 features, 2989 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: V3 V4 ... V3007 (2989 total)
## varLabels: tissue group ... Size_Factor (59 total)
## varMetadata: labelDescription
## featureData
## featureNames: Tspan12 Tshz1 ... Gm21943 (19802 total)
## fvarLabels: is_feature_control is_feature_control_Spike ...
## gene_short_name (17 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
In the first step, we will call estimateDispersions function to compute a smooth curve describing the relationship between the variance of the genes and the mean expression across all the cells. dispersionTable retrieves table of computed mean, empirical dispersion and fitted dispersion values. From this table we can subset top variable genes by thresholding on the mean and empirical dispersion. Black dots on the mean/dispersion figure marks the selected features.
cds <- BiocGenerics::estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp <- dispersionTable(cds)
head(disp)
## gene_id mean_expression dispersion_fit dispersion_empirical
## 1 Tspan12 0.27716042 5.633211 10.023476
## 2 Tshz1 0.41719267 3.973567 8.826994
## 3 Fnbp1l 0.83724181 2.325529 3.491744
## 4 Adamts15 0.04898606 28.664478 83.258508
## 5 Cldn12 0.30621583 5.164049 5.294964
## 6 Rxfp1 0.03631543 38.425363 12.890793
hvg <- subset(disp, mean_expression >= 0.5 & dispersion_empirical >= 0.1)
cds <- setOrderingFilter(cds, ordering_genes = hvg$gene_id)
plot_ordering_genes(cds)
cds <- cds[hvg$gene_id,]
Before performing clustering, we will reduce the feature selected matrix down to two dimensions using tSNE. Note that Monocle3 also provides UMAP as reduction_method.
cds <- reduceDimension(cds, max_components = 2, reduction_method = 'tSNE', verbose = T)
Unsupervised clustering in Monocle can be done using one of two techniques, densityPeak or louvain. densityPeak partitions cells based on the cell local density and their nearest distance. It aims to group cells with a higher local density and distance smaller than a fixed threshold. The number of clusters in this technique is automatically estimated through the inference. On the contrary, louvain approach is based on a community detection algorithm that seeks densely connected modules in a graph built on scRNAseq dataset. Although it is possible to fix the number of nearest neighbors when creating the graph, the number of clusters cannot be directly controlled. For the illustrative purpose, we will clusterize cells with densityPeak algorithm.
cds <- clusterCells(cds, num_clusters = NULL, method = "densityPeak")
## Distance cutoff calculated to 3.26211
table(cds$Cluster)
##
## 1 2 3 4 5
## 740 667 781 518 283
We can plot now the projected data colored by the clustering output and the available cell annotation. Note that plot will depend on the reduction_method you used previously (either tSNE or UMAP projected data will be plotted).
p1 <- plot_cell_clusters(cds, color_by = 'Cluster')
p2 <- plot_cell_clusters(cds, color_by = 'level1class')
p1
p2
Finally, we will verify the accuracy of the clustering using ARI index.
adjustedRandIndex(cds$Cluster,cds$level1class)
## [1] 0.6961306
Would you improve the performance of the method by setting method = “louvain” ? What is the difference between the number of estimated clusters in comparison to densityPeak approach?
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SC3_1.10.1 scran_1.10.2
## [3] mclust_5.4.3 Seurat_3.1.1
## [5] monocle_2.99.2 L1Graph_0.1.1
## [7] lpSolveAPI_5.5.2.0-17 DDRTree_0.1.5
## [9] irlba_2.3.3 igraph_1.2.4
## [11] Matrix_1.2-17 M3Drop_1.10.0
## [13] numDeriv_2016.8-1 scater_1.10.1
## [15] ggplot2_3.1.0 SingleCellExperiment_1.4.1
## [17] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [19] BiocParallel_1.16.6 matrixStats_0.54.0
## [21] Biobase_2.42.0 GenomicRanges_1.34.0
## [23] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [25] S4Vectors_0.20.1 BiocGenerics_0.28.0
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.11.1 R.utils_2.8.0
## [3] tidyselect_0.2.5 htmlwidgets_1.3
## [5] grid_3.5.1 combinat_0.0-8
## [7] docopt_0.6.1 Rtsne_0.15
## [9] munsell_0.5.0 codetools_0.2-16
## [11] ica_1.0-2 units_0.6-2
## [13] umap_0.2.0.0 statmod_1.4.30
## [15] future_1.14.0 miniUI_0.1.1.1
## [17] withr_2.1.2 colorspace_1.4-1
## [19] fastICA_1.2-1 knitr_1.22
## [21] rstudioapi_0.10 ROCR_1.0-7
## [23] robustbase_0.93-4 pbmcapply_1.3.1
## [25] gbRd_0.4-11 listenv_0.7.0
## [27] labeling_0.3 Rdpack_0.10-1
## [29] slam_0.1-45 bbmle_1.0.20
## [31] GenomeInfoDbData_1.2.0 pheatmap_1.0.12
## [33] rhdf5_2.26.2 coda_0.19-2
## [35] LearnBayes_2.15.1 xfun_0.5
## [37] R6_2.4.0 doParallel_1.0.14
## [39] ggbeeswarm_0.6.0 rsvd_1.0.2
## [41] VGAM_1.1-1 locfit_1.5-9.1
## [43] manipulateWidget_0.10.0 bitops_1.0-6
## [45] assertthat_0.2.1 promises_1.0.1
## [47] SDMTools_1.1-221 scales_1.0.0
## [49] nnet_7.3-12 beeswarm_0.2.3
## [51] gtable_0.3.0 npsurv_0.4-0
## [53] Cairo_1.5-10 globals_0.12.4
## [55] rlang_0.4.0 splines_3.5.1
## [57] lazyeval_0.2.2 acepack_1.4.1
## [59] checkmate_1.9.1 rgl_0.100.19
## [61] yaml_2.2.0 reshape2_1.4.3
## [63] crosstalk_1.0.0 backports_1.1.3
## [65] httpuv_1.5.0 Hmisc_4.2-0
## [67] tools_3.5.1 spData_0.3.0
## [69] gplots_3.0.1.1 RColorBrewer_1.1-2
## [71] dynamicTreeCut_1.63-1 ggridges_0.5.1
## [73] Rcpp_1.0.1 plyr_1.8.4
## [75] base64enc_0.1-3 zlibbioc_1.28.0
## [77] classInt_0.3-1 purrr_0.3.2
## [79] RCurl_1.95-4.12 densityClust_0.3
## [81] rpart_4.1-13 deldir_0.1-16
## [83] reldist_1.6-6 pbapply_1.4-0
## [85] viridis_0.5.1 cowplot_0.9.4
## [87] zoo_1.8-5 ggrepel_0.8.0
## [89] cluster_2.0.7-1 magrittr_1.5
## [91] RSpectra_0.13-1 data.table_1.12.0
## [93] gmodels_2.18.1 lmtest_0.9-36
## [95] RANN_2.6.1 mvtnorm_1.0-10
## [97] fitdistrplus_1.0-14 lsei_1.2-0
## [99] mime_0.6 evaluate_0.13
## [101] xtable_1.8-3 sparsesvd_0.1-4
## [103] gridExtra_2.3 HSMMSingleCell_1.2.0
## [105] compiler_3.5.1 tibble_2.1.1
## [107] KernSmooth_2.23-15 crayon_1.3.4
## [109] R.oo_1.22.0 htmltools_0.3.6
## [111] pcaPP_1.9-73 mgcv_1.8-28
## [113] later_0.8.0 spdep_1.0-2
## [115] Formula_1.2-3 rrcov_1.4-7
## [117] tidyr_0.8.3 expm_0.999-4
## [119] RcppParallel_4.4.2 DBI_1.0.0
## [121] WriteXLS_4.1.0 MASS_7.3-51.3
## [123] sf_0.7-3 boot_1.3-20
## [125] R.methodsS3_1.7.1 gdata_2.18.0
## [127] metap_1.1 pkgconfig_2.0.2
## [129] registry_0.5-1 foreign_0.8-71
## [131] sp_1.3-1 plotly_4.8.0
## [133] foreach_1.4.4 vipor_0.4.5
## [135] rngtools_1.3.1 pkgmaker_0.27
## [137] webshot_0.5.1 XVector_0.22.0
## [139] bibtex_0.4.2 doRNG_1.7.1
## [141] stringr_1.4.0 digest_0.6.18
## [143] tsne_0.1-3 sctransform_0.2.0
## [145] RcppAnnoy_0.0.11 rmarkdown_1.12
## [147] leiden_0.3.1 htmlTable_1.13.1
## [149] edgeR_3.24.3 uwot_0.1.4
## [151] DelayedMatrixStats_1.4.0 shiny_1.2.0
## [153] gtools_3.8.1 nlme_3.1-137
## [155] jsonlite_1.6 Rhdf5lib_1.4.3
## [157] BiocNeighbors_1.0.0 viridisLite_0.3.0
## [159] limma_3.38.3 pillar_1.4.2
## [161] lattice_0.20-38 ggrastr_0.1.7
## [163] DEoptimR_1.0-8 httr_1.4.0
## [165] survival_2.44-1.1 glue_1.3.1
## [167] qlcMatrix_0.9.7 FNN_1.1.3
## [169] png_0.1-7 iterators_1.0.10
## [171] glmnet_2.0-16 class_7.3-15
## [173] stringi_1.4.3 HDF5Array_1.10.1
## [175] latticeExtra_0.6-28 caTools_1.17.1.2
## [177] dplyr_0.8.0.1 e1071_1.7-1
## [179] future.apply_1.3.0 ape_5.3